home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Trading on the Edge
/
Trading On The Edge - CD-ROM Toolkit (Wayzata Technology)(2031)(1994).bin
/
pc
/
pc_files
/
mktdata
/
econdata
/
docutils
/
pdgsup.exe
/
BANKER.DOC
next >
Wrap
Text File
|
1992-04-17
|
17KB
|
344 lines
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
▒ ▒
▒ BANKER (ver 1.01) :: The G-Bank Maker ▒
▒ ▒
▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒
I. What is BANKER?
II. How to use BANKER?
III.How to format input data for BANKER?
IV. Notes for programmers
I. What is BANKER?
BANKER is a G data bank maker. It creates G data banks without going
through G. It is particularly suited for making large G data bank, since
BANKER-generated data bank is NOT bounded by the 2500 maximum number of
series associated with any G-generated bank currently. In other words,
with BANKER, your G-bank can literally include millions of series, and
the size of your hard disk becomes the effective constraint.
BANKER makes both hashed (.hbk, .hin) and compressed (.cbk, .cin) G
banks, however, users are strongly encouraged to produce the hashed bank, as
it is now the standard data bank I.E.R.F. regularly creates and maintains.
II. Usage:
Syntax: BANKER [option] <data> [bank]
Options are one or more of the following option characters preceded by "-"
or "/". If no option specified, hashed bank will be produced.
BANK TYPE OPTIONS OTHER OPTIONS
-h hashed bank (default) -b<###> issue number of BINs for hashed bank
-c compressed bank -v view series names while processing
-d debugging
-l look at the finished bank
-t[XXX] take bank title
Examples:
BANKER lift.dat :produces default bank (LIFT.HBK, LIFT.HIN)
BANKER -h lift.dat :produces LIFT.HBK, LIFT.HIN
BANKER -b101 lift.dat :makes the hashed bank with # of BINs = 101
BANKER -d lift.dat :make hashed bank, debugging only
BANKER -c lift.dat :produces LIFT.CBK, LIFT.CIN
BANKER -cv lift.dat :and displays series names while processing
BANKER -cvtinforum lift.dat :and takes the bank_title as 'inforum'
BANKER -ctinforum-d lift.dat :debugging, and bank_title = 'inforum'
BANKER -cdtinforum lift.dat :debugging, and bank_title = 'inforum'
BANKER -cd lift.dat :make compressed bank, debugging only
Note that normally only the very first option switch is required to be
preceded with '-' or '/'. But there is a caveat with -t switch. If it is
chosen and a bank title is attached next to it, then the next chosen option
must be preceded with '-' or '/', which would otherwise be optional.
Beside, if your bank title consists of more than one word, connect these
words with '+' so that there will not be any space in between the words.
In light of this, you may decide to ignore -t switch altogether, in which
case you will be prompted to provide a bank title once BANKER starts
running. Yet another way to input the bank title is to bury it in brackets
'{ }' and put it at the very beginning of your formatted data file. In a
way, brackets '{ }' indicate comments to BANKER.
A file named BANKER.ERR will be automatically produced for each run of
BANKER. It contains information on data compression status on each series as
well as some other related messages (if any). The most commom message on
data compression includes:
(1) "Gave up compression xxx: difference larger than 32767 or all-zero series";
(2) "Gave up compression xxx: difference larger than LONG_MAX"
The first simply means a all-zero series has been encountered or the
largest first difference of all observations of this series is greater than
32767. Note that if the observations have decimal places (floating-point
number), the first difference is taken after all the decimal places have
been slided to the rightmost. This point may also help you understand the
second message, in which case even though the largest first difference of
your data prior to decimal-point-sliding is no greater than LONG_MAX
(2,147,483,647), after the "sliding" (especially those numbers have many
decimal places), the largest first difference may well exceeds LONG_MAX.
III. How to format data file for BANKER
BANKER handles all conventional G DATA and MATDAT input formats,
which are assumed to be known to the G user. There are detailed discussion
in G "help" files and in the Appendix of Clopper Almon's "The Craft of
Economic Modeling".
BANKER also accepts the INFORUM one-series-per-line format, which
was initially developed to process very large data banks more efficiently,
such as the World Tables of Social and Economic Indicators. The general
idea of this format is to stock all the information about a series on one
line (record), including, of course, all its observations.
Here is an example of the one-series-per-line format:
▒ gdp$ 1985 Q 1 0 5025 5130.5 5.255e3 .. .. ▒
▒ ▒
▒ where, ▒
▒ gdp$ : series_name ▒
▒ 1985 : baseyear (can also be written as 85) ▒
▒ Q : frequency (M = Monthly, A = Annual) ▒
▒ 1 : starting_period ▒
▒ 0 : decimal point left-shift factor ▒
▒ 5025 : observation #1 ▒
▒ 5130.5 : observation #2 ▒
▒ 5.255e3 : observation #3 ▒
▒ .. : .. ▒
All seems clear except, perhaps, the fifth item under this format -
decimal point left-shift factor. Thus some explanations (orjustifications
for it) are in order. Simply put, this left-shift factor enables you to
manipulate (add) the decimal places in your data. For example, some BLS
data often requires some additional decimal places, and this left-shift
factor can help you do just that. If, on the other hand, there is no need
to modify decimal places in your data, use zero as the left-shift factor.
Please take note that each item must be separated by at least one space
(as shown in the example above).
It should be mentioned in passing that BANKER will keep intact the
series names exactly as in the formatted input file. That is, if a
series name (or part of it) is in upper (lower) case, it will remain that
way in the bank created by BANKER.
IV. Notes for programmers
As indicated earlier, BANKER can make both hashed and compressed G
banks. In the compressed bank, each series has its own starting date
and number of observations and most series have been compressed.
Compression involves these steps:
- find and record the number of decimal points in the series.
- slide the decimal to the right until the series is all integers.
- record the first observation as a 4-byte integer.
- record the first differences of the series as 2-byte integers.
- if a series cannot be accurately recorded in this compressed
form, it is declared to have 255 decimal places (just a flag), and
the observations recorded as four-byte floating point numbers.
- missing observations are marked with a special code.
- the data file has the extension "cbk"; and the index, "cin".
The .cin file contains:
item size in bytes C type
==============================================================
ns 2 int
nc 2 unsigned int
names nc char
Here ns is the number of series, and nc is the number of characters in
all the series names, counting the nulls at the end of each series name.
If there were three variables in the bank with names tom, dick, and harry,
ns would be 3, and the names vector would be
tom0dick0harry0
where 0 represents a null ('\0' in C), and nc would be 15, the number of
characters in the names vector, counting the nulls.
The .cbk file contains:
item size in bytes C type
==============================================================
title 80 char
ns 2 int
position 4 unsigned long
series 1 variable see below
.... ........
series n variable see below
indx 4*ns unsigned long
Here, ns is again the number of series; indx is an array containing the
byte numbers in this file at which the series begin. To continue the example,
suppose that the series "tom" requires 101 bytes, "dick" requires 121 bytes,
and "harry" requires 81. Then "indx" is the vector (86, 187, 308). (Remember
that in C a file starts with byte 0). The "position" is the byte number at
which this "indx" array begins. In the example, it will be 389.
Each compressed series has the format:
item size in bytes C type
==============================================================
BaseYear 1 unsigned char
FreqPeriod 1 unsigned char
SlashDecplaces 1 unsigned char
NDif 2 int
FirstObs 4 long
Differences 2*NDif int
where
BaseYear = the year of the first observation, minus 1900.
FreqPeriod = 16*frequency+period, where
frequency is the number of observations per year (1, 4, or 12)
period = period of first observation. (For frequencies above 12,
set FreqPeriod = 255. This value signals that two integers have
been inserted following this byte containing the frequency and the
period.)
SlashDecplaces = 16*SlashFactor + Number of Decimal places
(SlashFactor is normally 0; the Press program, however, allows the
option of dividing by a power of 2 to reduce the magnitudes of a
series so that it can be compressed. The SlashFactor is the power
(1, 2, 3, etc.) used on this series. In Press, the default maximum
slash factor is 0, so the occurrence of non-zero slash factors is
unusual.) Usually, SlashDecplaces is just the number of decimal
places. If it is 255, the series has not been compressed.
NDif = the number of differences (= Number of observations - 1).
FirstObs = the first observation, as a four-byte integer.
Differences = the first differences of the series, as 2-byte integers.
If a zero occurs in the series, it is indicated by 32767 in the
differences. The following difference applies to the previous non-zero
number, not to the zero. This practice was adopted because some banks have
series with numerous missing observations which appear as zeroes. Also,
some banks consider quarterly series to be monthly series in which only
the end-of-quarter months have non-zero values.
If it was not possible to compress the series, the format is:
item size in bytes C type
==============================================================
BaseYear 1 unsigned char
FreqPeriod 1 unsigned char
255 1 unsigned char
nobs 2 int
Observations 4*nobs float
Note that the 255 in the third byte is the signal that the series is not
compressed. The next two bytes represent the number of observations, and
then follow the observations as 4-byte floating point numbers.
The compressed form can represent a series as accurately as can an 18-
foot-high graph printed with laser-printer resolution of 300 dots
per inch. (All series in the US National Accounts or Industrial
Production Indexes compress easily. In the Blue Pages of the Survey of
Current Business, however, nearly ten percent fail to compress. In IMF
data, the hyperinflation of many third-world countries produces series
which fail to compress.)
Hashed banks differ from compressed banks mainly in the organization
of their index files. With standard and compressed banks, G keeps the
names of the series in memory and simply does a linear search for a
name each time one is requested. In the hashed banks, the names are
grouped into bins on the basis of a number calculated from the letters
of the name. When a name is requested, G calculates the number,
locates the bin in which the name has been put, reads in the names in
that bin, and does a linear search over only those names to find the
desired series. The size of compressed banks is limited by the
requirement that the total number of characters in the names of all
series must be less than 64,000. In practice, that limit translates to
about six or seven thousand series. Hashed banks, in contrast, can go
up to several million series. The effective constraint becomes the
size of the user's hard disk. The data file has the extension "hbk";
and the index, "hin".
The precise form of the hashed bank .hin and .hbk files is as follows:
The ".hin" file contains:
item size in bytes C type
==============================================================
ns 4 long
nbins 2 unsigned
nsb 2*nbins unsigned
ncharb 2*nbins unsigned
posbin 4*nbins unsigned
binname(0) nchar[0] char
binposts(0) 4*nnmsb[0] long
binname(1) nchar[1] char
binposts(1) 4*nnmsb[1] long
binname(2) nchar[2] char
binposts(2) 4*nnmsb[2] long
.
.
binname(nbins-1) nchar[nbins-1] char
binposts(nbins-1) 4*nnmsb[nbins-1] long
Here, ns denotes the number of series in the bank. The series are
separated into "bins". The number of bins in the bank is denoted by
nbins. The number of series in each bin is denoted by the array nsb.
The sum of the number of characters in the names (including each '\0')
of the series contained in each bin is denoted by the array ncharb.
The beginning positions in the ".hin" file of the first bytes of the
binname() strings is given by the array posbin. The string binname(i)
denotes the concatination (including the \0's) of all the series names
in the i-th bin. Finally, binposts(i) denotes the array of beginning
positions in the associated ".hbk" of the series in the the i-th bin.
Of course, the ordering of the series in the binname() and binposts()
arrays must be the same.
Consider an example. Suppose that the 3rd bin contains the series
"joe", "dave", and "bill". The string binname(3) would be
"joe\0dave\0bill\0"
Suppose that the starting positions in the ".hbk" bank for the three
series are 40700008, 490987, 3378294. The array binposts(3) would then
be [40700008, 490987, 3378294]. And nsb[3] = 3, and ncharb[3] = 14.
If the beginning position of binname(3) in the ".hin" file is 4724, then
posbin[3] = 4724.
To assign a bin number to a series you must use the following hashing
routine. In C, the routine is:
unsigned hash(char *s);
hash(char *s)
{
unsigned bill;
for (bill=0;*s!='\0';s++) bill = *s + 31*bill;
bill = bill%nbins;
return(bill);
}
To continue with the example, to determine the bin which the series "joe"
really belongs to you'd evaluate the function hash("joe").
The .hbk file:
0 - 79 char Name of bank (terminated with a null)
80 - 81 int ns, number of series in the bank
82 - 85 long psn, position in file of index
86 - first series, as described below
*(psn+1) - second series,
... ...
psn long position in file of first byte of first series
psn+4 long position in file of first byte of second series
... ... on out to ns series
For each series, the format is:
byte Content
0 base year
1 frequency*16+period
2 slash*16+maxplaces or 255 if not compressed
3-4 number of observations
5-8 first observation as a long
9 - differences as integers
if not compressed, floats begin in byte 5